Please note that there is a file on Canvas called Getting started with R which may be of some use. This provides details of setting up R and Rstudio on your own computer as well as providing an overview of inputting and importing various data files into R. This should mainly serve as a reminder.
Recall that we can clear the environment using rm(list=ls()) It is advisable to do this before attempting new questions if confusion may arise with variable names etc.
Example 1
In this example we want to determine whether there is any difference between the driving reaction times after drinking water, coffee and alcohol in the Kruskal-Wallis (Reaction Times) dataset. Please follow the steps below:
Firstly, we input the dataset into R and check it using the ``head” command (recall this provides the first 6 entries). We use the gdata package to import the Excel file.
We next perform a test of normality on the standardized residuals to determine which method should be used. If the residuals are normally distributed and we have homogeneity of variances (Levene’s test) then we would use ANOVA. The following code creates and performs tests of normality on the residuals.
Shapiro-Wilk normality test
data: residuals
W = 0.89219, p-value = 0.005437
library(nortest)lillie.test(residuals)
Lilliefors (Kolmogorov-Smirnov) normality test
data: residuals
D = 0.14947, p-value = 0.08544
Clearly the residuals are not normally distributed, therefore we must use the nonparametric equivalent of the ANOVA procedure, i.e. the Kruskal-Wallis H-test. In order to determine the correct hypotheses, we next investigate the shapes of the distributions of the groups using histograms:
DrinkTypes<-factor(Reaction$"Drink Groups",c(1,2,3),labels=c('Water','Coffee', 'Alcohol'))par(mfrow=c(1,3)) hist(Reaction$"Reaction Times"[DrinkTypes=='Water'], xlab='Reaction Time', main='Histogram for Water', breaks=seq(from=0,to=4, by=0.2)) hist(Reaction$"Reaction Times"[DrinkTypes=='Coffee'], xlab='Reaction Time', main='Histogram for Coffee', breaks=seq(from=0,to=4, by=0.2))hist(Reaction$"Reaction Times"[DrinkTypes=='Alcohol'], xlab='Reaction Time', main='Histogram for Alcohol', breaks=seq(from=0,to=4, by=0.2))
From the histograms we see that the groups seem to have similar shaped distributions, therefore use the following hypotheses:
\(H_0:\) The medians of all groups are equal;
\(H_1:\) At least one group has median not equal to the others.
You should see that there is a significant difference between drinks 1 and 3, i.e. between water and alcohol. We therefore report medians using the following code:
The Diets dataset contains data on the weight loss of 3 groups of participants who took park in 3 different diets. Determine whether there is any difference in the effectiveness of the diets, fully justifying the method used.
Solution
We first import the data.
Diets<-read.csv(file.choose(), header = T)
In order to determine whether we need to use a parametric or non-parametric technique, we investigate the normality of the residuals.
Shapiro-Wilk normality test
data: residuals2
W = 0.89376, p-value = 0.04479
library(nortest)lillie.test(residuals2)
Lilliefors (Kolmogorov-Smirnov) normality test
data: residuals2
D = 0.27639, p-value = 0.00078
The results indicate that the residuals do not appear to be normally distributed, hence we use the Kruskal-Wallis test (as we have 3 independent groups).
In order to identify the correct hypotheses, we need to investigate the shape of the distributions of the groups.
par(mfrow=c(1,3)) hist(Diets$Weight.Loss[Diets$Diet==1], xlab='Weight Loss', main='Histogram for Diet 1', breaks=seq(from=-10,to=10, by=2)) hist(Diets$Weight.Loss[Diets$Diet==2], xlab='Weight Loss', main='Histogram for Diet 2', breaks=seq(from=-10,to=10, by=2))hist(Diets$Weight.Loss[Diets$Diet==3], xlab='Weight Loss', main='Histogram for Diet 3', breaks=seq(from=-10,to=10, by=2))
From the histograms we see that the groups seem to have similar shaped distributions, therefore we will compare medians in the Kruskal-Wallis test.
kruskal.test(Diets$Weight.Loss~Diets$Diet)
Kruskal-Wallis rank sum test
data: Diets$Weight.Loss by Diets$Diet
Kruskal-Wallis chi-squared = 6.5272, df = 2, p-value = 0.03825
As we have a significant result we undertake a Dunn post hoc test with Bonferroni correction.
Next we check the normality of the residuals using the following code. We first need to upload the dataset Work Lateness.csv, which in a different format. (It is not as straightforward as in the Kruskal-Wallis case.)
Shapiro-Wilk normality test
data: residuals
W = 0.94283, p-value = 0.2478
The residuals seem to be normally distributed, however we continue with the Friedman test as the sample size is small (and we will see shortly that the data are skewed). We next investigate the shapes of the group to determine the correct hypotheses.
par(mfrow=c(1,3)) hist(Baseline, xlab='Late Days', main='Histogram for Baseline', breaks=seq(from=0,to=20, by=2)) hist(Month1, xlab='Late Days', main='Histogram for Month 1', breaks=seq(from=0,to=20, by=2))hist(Month2, xlab='Late Days', main='Histogram for Month 2', breaks=seq(from=0,to=20, by=2))
We see that the groups seem to have the same shaped distribution, hence we compare medians in the hypotheses. The histograms are all skewed, providing further evidence that non parametric techniques are not suitable. We now proceed with Friedman’s test. Note that we have to make a matrix out of the data to perform the test.
Pairwise comparisons using Tukey-Kramer-Nemenyi all-pairs test with Tukey-Dist approximation
data: LatenessNew$Lateness by LatenessNew$Bonus
0 1
1 0.0794 -
2 0.0072 0.6646
P value adjustment method: single-step
alternative hypothesis: two.sided
You should come to the conclusion from this output that there is a significant difference between Baseline and Month2. We next report their medians using the code:
median(Baseline)
[1] 11
median(Month2)
[1] 5
Exercise 2
Determine whether there are any differences between the groups Overall score (Video A), Overall score (Video B), Overall score (Video C) and Overall score (Demo D) in the Public Awareness SPSS dataset.
Hint: Before performing the Friedman test you will need to create a matrix of the relevant variables. The following code may help you to do this:
You will then need to create another dataset with one column containing all the awareness scores and a corresponding label column. I would suggest exporting your matrix created above using the write.csv() command and making these changes in Excel, then re-import your new dataset.
We assume that the residuals are not normally distributed.
We next investigate the shapes of the group to determine the correct hypotheses.
par(mfrow=c(1,4)) hist(Awareness$TotalAGen, xlab='Overall Score', main='Histogram for Video A', breaks=seq(from=5,to=30, by=4)) hist(Awareness$TotalBdoc, xlab='Overall Score', main='Histogram for Video B', breaks=seq(from=5,to=30, by=4))hist(Awareness$TotalCOld, xlab='Overall Score', main='Histogram for Video C', breaks=seq(from=5,to=30, by=4))hist(Awareness$TotalDDEMO, xlab='Overall Score', main='Histogram for Demo D', breaks=seq(from=5,to=30, by=4))
The histograms show reasonably similar distribution shape, with there being a clear right skew for Video A, Video B and Demo D. We will therefore compare medians in the hypotheses for the Friedman test.
Friedman rank sum test
data: AwarenessMatrix
Friedman chi-squared = 41.372, df = 3, p-value = 5.452e-09
Clearly we have a significant result, therefore we proceed with post hoc testing.
Now a new data set is required, i.e. in a different format. We first export the matrix, create a new column containing the Awareness variable with a labelling column